Analyze rain vs. ridership with statistical tests

Statistical tests can help us to obtain statistical rigor in comparing and evaluating data. Here I'm running a few tests with α level = 0.05 on ridership data of raini and non-raini days, to validate the distribution and statistical significance.

  • Normality
  • Equal means - Welch's t-test
  • Same populations - Mann–Whitney U test
First import libraries for mathematical operations and plotting, and initialize global variables.

In [24]:
import numpy as np
import pandas as pd
import scipy.stats as st

df = pd.read_csv('turnstile_weather_v2.csv', parse_dates=['datetime'])
ent_hr_rain = df[df["rain"] == 1]["ENTRIESn_hourly"]
ent_hr_no_rain = df[df["rain"] == 0]["ENTRIESn_hourly"]

Normality

From the histograms in section 0 we already knew that the distribution is right-skewed. To obtain statistic rigor, I'm double checking again. But because Shapiro–Wilk test may not be accurate for N > 5000, here I'm testing the normality with scipy.stats.normaltest which combines skew and kurtosis normality tests. Its p-value is a 2-tail chi squared probability for the hypothesis test. From the result p-values we can reject the null hypothesis that the data was drawn from a normal distribution for both rainy and non-rainy days. Both are non-normal.


In [25]:
# w, p = st.shapiro(df[(df['rain'] == 0) & (df['weekday'] == 1)]['ENTRIESn_hourly'])
# Got warning: p-value may not be accurate for N > 5000.
k2, p = st.normaltest(ent_hr_rain)
print 'RAIN: k2 = {0:.5f}, p-value = {1:.5f}'.format(k2, p)
k2, p = st.normaltest(ent_hr_no_rain)
print 'NO RAIN: k2 = {0:.5f}, p-value = {1:.5f}'.format(k2, p)


RAIN: k2 = 7995.34478, p-value = 0.00000
NO RAIN: k2 = 28009.07645, p-value = 0.00000

Equal means - Welch's t-test

In section 0 we already got the stats of hourly entries (every 4 hours) for rainy and non-rainy days:

Hourly entries - rain: Count = 9,585, Total = 19,440,259, Median = 939, Mean = 2,028.20
Hourly entries - no rain: Count = 33,064, Total = 61,020,916, Median = 893, Mean = 1,845.54

Median and mean are both a bit higher for rainy days. But are they statistically higher than non-rainy days? Though when we perform Welch's t-test we usually assume the data is normal, with large enough sample size, we can still run Welch's t-test to see if 2 data sets have the same sample means (μ1 = μ2). Welch's t-test doesn't assume equal sample size nor equal variance. Below the scipy.stats.ttest_ind function returns a 2-tail p-value, which is much smaller than p-critical 0.025. We can reject the ull hypothesis that 2 independent samples have identical average (expected) values. Here t-statistic is greater than 0 and 1-tail p-value/2 is much less than p-critical 0.025. We can confirm that the mean of entries is significantly higher when it rained (μ1 > μ2). This is interesting. From the bar chart in section 0 these 2 data sets don't look too different but they are statistically different.


In [26]:
t, p = st.ttest_ind(ent_hr_rain, ent_hr_no_rain, equal_var = False)
print '2-tail RAIN vs. NO RAIN: t = {0:,.5f}, p-value = {1:.10f}'.format(t, p)
print '1-tail RAIN vs. NO RAIN: p-value/2 = {0:.10f}'.format(p/2)


2-tail RAIN vs. NO RAIN: t = 5.04288, p-value = 0.0000004641
1-tail RAIN vs. NO RAIN: p-value/2 = 0.0000002321

One way to check how much the factor could affect the results is by calculating a R^2 = t^2 / (t^2 + Degree of Freedom). And the result shows that only about 0.06% of the differences in ridership was due to rain. Though the ridership was statistically higher for rainy days, some other factors are playing important roles too.


In [27]:
r_squared = t*t / (t*t + (9585 + 33064 - 2))
print 'R^2 = t^2 / (t^2 + Degree of Freedom) = {}'.format(r_squared)


R^2 = t^2 / (t^2 + Degree of Freedom) = 0.00059595073468

Same populations - Mann–Whitney U test

Since the data is non-normal, we can try Mann–Whitney U test which is a nonparametric test of the null hypothesis that two populations are the same. Mann–Whitney U test doesn't assume any underlying probability distribution. Below I'm running scipy.stats.mannwhitneyu function to see if these 2 data sets are from the same population. This function returns a 1-tail p-value.


In [28]:
u, p = st.mannwhitneyu(ent_hr_rain, ent_hr_no_rain)
print '1-tail RAIN vs. NO RAIN: u = {0:,.2f}, p-value = {1:.10f}'.format(u, p)


1-tail RAIN vs. NO RAIN: u = 153,635,120.50, p-value = nan

However, I can't get p-value directly using this function. So below I'm calculating p-value maually, following the steps for normal approximation provided in http://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test#Normal_approximation. This test is often performed as a 2-tail test, thus, the research hypothesis indicates that the populations are not equal as opposed to specifying directionality. Below I'm converting it to 2-tail p-value which is much less than p-critical 0.05. And we can reject the null hypothesis that two populations are the same. Ridership of raining and non-raining days are different.


In [29]:
m_u = len(ent_hr_rain) * len(ent_hr_no_rain) / 2
print "m_u = {0:,.1f}".format(m_u)
sigma_u = np.sqrt(len(ent_hr_rain) * len(ent_hr_no_rain) * (len(ent_hr_rain) + len(ent_hr_no_rain) + 1) / 12)
print "sigma_u = {0:,.5f}".format(sigma_u)
z = (u - m_u) / sigma_u
print "z = {0:,.5f}".format(z)
p = 2 * st.norm.cdf(z)
print '2-tail RAIN vs. NO RAIN: u = {0:,.2f}, 2*p-value = {1:.10f}'.format(u, p)


m_u = 158,459,220.0
sigma_u = 1,061,310.96079
z = -4.54542
2-tail RAIN vs. NO RAIN: u = 153,635,120.50, 2*p-value = 0.0000054827